# SBOM Toolkit Dataset Manifest

Generated: 2025-12-31 18:59:56 UTC

## Overview

This dataset accompanies the SBOM Toolkit research project, which applies graph
neural networks to software supply chain security analysis. The dataset contains
Software Bills of Materials (SBOMs), vulnerability scan results, trained model
checkpoints, and evaluation metrics.

## Archives

| Archive | Description | Size | SHA256 |
|---------|-------------|------|--------|
| `evaluations.tar.gz` | Model evaluation results and metrics | 2.7 MB | `1682994cea5deff2...` |
| `models.tar.gz` | Trained GNN model checkpoints | 751.7 KB | `42a8ecc5acb7c8dd...` |
| `reference_data.tar.gz` | Attack chain data and vulnerability caches | 2.9 MB | `4d80a7243d4c6ef2...` |
| `sboms.tar.gz` | Software Bills of Materials (CycloneDX JSON format) | 30.9 MB | `5988e279e9a1db84...` |
| `scans.tar.gz` | Enriched vulnerability scan results | 7.4 MB | `6de91cbe682c895a...` |

## Archive Contents

### sboms.tar.gz

Software Bills of Materials (CycloneDX JSON format)

**Included paths:**
- `data/filtered_sboms/` (5,349 files, 102.6 MB)
- `data/scanned_sboms/` (5,349 files, 209.6 MB)

### scans.tar.gz

Enriched vulnerability scan results

**Included paths:**
- `outputs/scans/` (2,701 files, 187.4 MB)

### models.tar.gz

Trained GNN model checkpoints

**Included paths:**
- `outputs/models/` (7 files, 1.8 MB)

### evaluations.tar.gz

Model evaluation results and metrics

**Included paths:**
- `outputs/evaluations/` (5,421 files, 22.5 MB)

### reference_data.tar.gz

Attack chain data and vulnerability caches

**Included paths:**
- `data/external_chains/` (1 files, 40.0 KB)
- `data/ac_data/` (2 files, 1.2 MB)
- `data/cve_cache/` (136 files, 375.9 KB)
- `data/cwe_cache/` (1 files, 1.7 MB)
- `data/capec_cache/` (1 files, 3.7 MB)

## Data Formats

### SBOM Files
- Format: CycloneDX JSON (spec version 1.5)
- Naming: `{commit_hash}` for filtered, `{commit_hash}_enriched` for scanned
- Contains: Components, dependencies, vulnerabilities (CVEs), and metadata

### Scan Results
- Format: JSON
- Contains: Enriched vulnerability data with CVSS scores, CWE mappings, and severity

### Model Checkpoints
- Format: PyTorch (.pt)
- Contains: Model state dict, training configuration, and input dimensions

### Evaluation Results
- Format: JSON and CSV
- Contains: Predictions, ground truth labels, and performance metrics

## Usage

1. Extract archives to the project root:
   ```bash
   for f in *.tar.gz; do tar -xzf "$f" -C /path/to/sbom-toolkit; done
   ```

2. Or use the download script:
   ```bash
   uv run python scripts/download_data.py
   ```

## Citation

If you use this dataset, please cite:

```bibtex
@misc{sbom_toolkit_dataset,
  author = {Baird, Laura},
  title = {SBOM Toolkit: Software Bill of Materials Dataset for GNN-based Vulnerability Prediction},
  year = {2025},
  publisher = {Harvard Dataverse},
  doi = {10.xxxx/DVN/XXXXXX}
}
```

## License

This dataset is released under the MIT License.
